Setup

rm(list=ls())
knitr::opts_knit$set(root.dir = '~/OneDrive - University College London/Projects/Ostridge_PanAf/baypass/analysing_baypass_output/baypass_aux/') 

Parameters

subsps=c('all', 'ce', 'w')
env_data='f_over_sum_known_trees'
habitat_col='black'
col_scale=colorRampPalette(c('goldenrod1', 'yellow4', 'darkgreen'))(100)

Here I plot the distributions of key statistics from running the BayPass AUX model with habitat classifications from Aleman et al. (2020) (using 4 categories) for all subspecies datasets.

Environmental data input files were made with ../../environmental_data/scripts/1.extract_environmental_and_behavioural_data.Rmd and ../../environmental_data/scripts/2.format_env_file.Rmd.

BayPass was run on myriad with scripts in myriad:analysis/phase1and2_exome_analysis/baypass/scripts/mapped.on.target/. BayPass was run three times with different seeds.

BayPass output files we formatted with ./scripts/format_baypass_aux_output.ipynb where SNP coordinates, gene annotations and results from multiple independent runs were added. Statistics with no suffix represent results from the focal run (e.g. ‘M_Beta’), and statistics from the the two repeat runs have the suffixes ‘.r1’ and ‘.r2’ (e.g. ‘M_Beta.r1’ and ‘M_Beta.r2’). Median values have also been calculated. I also add the results from random runs used to generate nulls.

Library

library(data.table)
options(datatable.fread.datatable=FALSE)
library(dplyr)
library(ggplot2)
library(VennDiagram)
library(psych)
library(tidyverse)
library(cowplot)
library(raster)

source("scripts/baypass_aux_tools.R")
source("../scripts/baypass_tools.R")
source("../baypass_core/scripts/candidate_allele_frequency_patterns.R")

Read in data

## - all 
## - ce 
## - w

Covariable Map

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/harrisonostridge/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Projects/PanAf/maps/chimp_ranges/chimp_ranges.shp", layer: "chimp_ranges"
## with 4 features
## It has 28 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/harrisonostridge/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Projects/PanAf/maps/ne_10m_rivers_lake_centerlines/ne_10m_rivers_lake_centerlines.shp", layer: "ne_10m_rivers_lake_centerlines"
## with 1473 features
## It has 38 fields
## Integer64 fields read as strings:  ne_id 
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/harrisonostridge/Library/CloudStorage/OneDrive-UniversityCollegeLondon/Projects/PanAf/maps/africawaterbody/Africa_waterbody.shp", layer: "Africa_waterbody"
## with 833 features
## It has 6 fields
## Integer64 fields read as strings:  AF_WTR_ID

Review output files

Information comes from the 2019 BayPass manual.

BayPass output files we formatted with /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/phase1and2_exomes/baypass/baypass_aux/scripts/format_outputformat_baypass_aux_output.Rmd.

summary_Pdelta.out file

M_P and SD_P: “posterior mean and the standard deviation of the parameter P corresponding to the overall proportion of SNPs associated with each given covariable.”

  • aP and bP define the prior odds on P. M_P and SD_P correspond to the posterior estimates of P

I think M_P is very informative as it tells us how much evidence there is for association to a habitat across the genome.

Summary_betai.out file

M_Beta and SD_Beta: posterior mean and the standard deviation of the regression coefficient β.

PIP: posterior mean of the auxiliary variable.

BF(dB): estimated Bayes Factor in dB units (10×log10(BF)).

Here I plot the results from a single run.

## all

## ce

## w

Plot Median Values

From here on, I plot median values over three independent runs with different seeds to account for run-to-run variation.

Volcano Plot

NB: if there are ~0 SNPs associated with a habitat (i.e. the posterior mean of the parameter P is < 0.005), I do not include it in the graph as it messes up the axis scales.

BF

NB: An apparent excess of SNPs where BF=53 may occur because this is the maximum BF.

  • I think this is because if the posterior mean of the axillary variable (delta) is 1 or 0, then to account for the finite number T of MCMC sampled values, the posterior mean of delta is set to (T-0.5)/(T-1) or 0.5/(T-1) respectively. I think this is to account for the fact that we cannot be 100% certain an association is true (or not) as we haven’t sampled infinite MCMC runs?

BF distributions tend to reflect M_P estimates with higher M_P values corresponding to BF distributions shifted to higher values or with longer tails.

Beta

NB: The original BayPass paper describes beta > 0.2 as very strong associations and betas < 0.1 as weak associations.

NB: The density distribution of beta values is shown on a log scale (to account for the vast excess of sites with beta ≈ 0) and so differences between habitat distributions are more pronounced than they might first appear.

Plot

## all
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## ce
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis

## w
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis

  • there is a clear skew towards negative beta values in all samples and central-eastern.

Distribution Across the Genome

Correlation with XtX

## all

## ce

## w